Add more agent evals #961

tkattkat · 2025-08-12T17:57:50Z

part of STG-653

why

Adds more evals to agent

what changed

Added ~ 15 new evals

test plan

tested locally
tested on browserbase

changeset-bot · 2025-08-12T17:57:53Z

🦋 Changeset detected

Latest commit: a824aa6

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

greptile-apps

Greptile Summary

This PR adds 17 new agent evaluation tasks to the Stagehand evaluation suite as part of STG-653. The new evaluations test the AI agent's capabilities across diverse real-world scenarios including e-commerce (Amazon shoes, Google Shopping, UberEats), entertainment platforms (Steam Games, Apple TV), research tools (arXiv, Hugging Face, WolframAlpha), and various web services (GitHub, Google Maps, NBA trades on ESPN, hotel booking).

All new evaluation files follow the established pattern from the existing agent evaluation framework:

Navigate to target website using stagehand.page.goto()
Create an agent with dynamic provider selection based on model name (Claude uses "anthropic", others use "openai")
Execute specific instructions with defined step limits (typically 14-30 steps)
Evaluate success based on agentResult.success property
Include proper error handling, logging, and resource cleanup

The tasks are added to evals.config.json under the 'agent' category, integrating them into the existing evaluation pipeline. These evaluations expand test coverage to validate agent performance across complex multi-step workflows like checkout processes, search filtering, information extraction, and form filling on production websites.

Confidence score: 3/5

This PR requires careful review due to several evaluation quality issues and potential risks from using production websites
Score lowered due to lack of proper result validation in most tasks, reliance on production sites that may change, and some logical flaws in evaluation criteria
Pay close attention to evals/tasks/agent/kith.ts for payment form risks, evals/tasks/agent/hotel_booking.ts for validation gaps, and the formatting issue in evals.config.json

_{17 files reviewed, 6 comments}

_{Edit Code Review Bot Settings | Greptile}

evals/tasks/agent/nba_trades.ts

evals/tasks/agent/arxiv_gpt_report.ts

evals/tasks/agent/github.ts

evals/tasks/agent/kith.ts

evals/tasks/agent/wolframalpha_weight_loss.ts

greptile-apps

Greptile Summary

This review covers only the changes made since the last review (commit e71810e), not the entire PR.

The most recent changes complete a major refactoring of the agent evaluation system by centralizing agent initialization logic. The key changes include:

Agent initialization centralization: All agent evaluation functions have been updated to receive a pre-configured agent parameter instead of creating their own agent instances. This eliminates the duplicate model selection and provider mapping logic that was scattered across individual evaluation files.
Type system updates: The StagehandInitResult type in types/evals.ts now includes an agent property using ReturnType<Stagehand["agent"]>, enabling evaluation functions to access agent functionality through dependency injection.
Centralized configuration: The initStagehand.ts file now includes Computer Use Agent (CUA) model detection logic that automatically determines if a model supports computer use capabilities (checking for 'computer-use-preview' or models starting with 'claude') and creates appropriate agent configurations with proper provider mapping.
Standardized evaluation pattern: All ~20 agent evaluation files now follow a consistent pattern where they receive a pre-initialized agent, execute instructions using agent.execute(), and validate results based on agentResult.success. This creates uniformity across the evaluation suite.

The refactoring moves from a decentralized approach where each evaluation file handled its own agent setup to a centralized dependency injection pattern. This architectural change reduces code duplication, ensures consistent agent configuration across all evaluations, and provides better maintainability for global agent behavior modifications. The changes integrate with the existing evaluation framework by extending the StagehandInitResult interface and updating the initialization flow to provide agent functionality to evaluation tasks.

Confidence score: 4/5

This PR is safe to merge with good architectural improvements and consistent patterns
Score reflects clean refactoring with proper type safety, though some evaluations lack robust result validation
Pay close attention to files with time-dependent instructions or weak validation logic

_{29 files reviewed, 4 comments}

_{Edit Code Review Bot Settings | Greptile}

evals/tasks/agent/sign_in.ts

evals/tasks/agent/google_flights.ts

evals/initStagehand.ts

evals/tasks/agent/google_maps_3.ts

evals/initStagehand.ts

evals/tasks/agent/all_recipes.ts

…k config

seanmcguire12 · 2025-08-19T17:54:37Z

types/evals.ts

@@ -13,6 +13,7 @@ export type StagehandInitResult = {
  sessionUrl: string;
  stagehandConfig: ConstructorParams;
  modelName: AvailableModel;
+  agent: ReturnType<Stagehand["agent"]>;


do we have an actual agent type

@miguelg719

This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @browserbasehq/[email protected] ### Patch Changes - [#951](#951) [`f45afdc`](f45afdc) Thanks [@miguelg719](https://github.com/miguelg719)! - Patch GPT-5 new api format - [#954](#954) [`261bba4`](261bba4) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add support for shadow DOMs (open & closed mode) when experimental: true - [#944](#944) [`8de7bd8`](8de7bd8) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - Bump zod version compatibility and add pathing spec - [#919](#919) [`3d80421`](3d80421) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - enable scrolling inside of iframes - [#963](#963) [`0ead63d`](0ead63d) Thanks [@tkattkat](https://github.com/tkattkat)! - Properly handle images in evaluator + clean up response parsing logic - [#961](#961) [`8422828`](8422828) Thanks [@tkattkat](https://github.com/tkattkat)! - Add more evals for stagehand agent - [#946](#946) [`b769206`](b769206) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: unable to act on/get content from some same process iframes - [#962](#962) [`72d2683`](72d2683) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - handle namespaced elements in xpath build step ## @browserbasehq/[email protected] ### Patch Changes - Updated dependencies \[[`f45afdc`](f45afdc), [`261bba4`](261bba4), [`8de7bd8`](8de7bd8), [`3d80421`](3d80421), [`0ead63d`](0ead63d), [`8422828`](8422828), [`b769206`](b769206), [`72d2683`](72d2683)]: - @browserbasehq/[email protected] ## @browserbasehq/[email protected] ### Patch Changes - Updated dependencies \[[`f45afdc`](f45afdc), [`261bba4`](261bba4), [`8de7bd8`](8de7bd8), [`3d80421`](3d80421), [`0ead63d`](0ead63d), [`8422828`](8422828), [`b769206`](b769206), [`72d2683`](72d2683)]: - @browserbasehq/[email protected] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

greptile-apps bot reviewed Aug 12, 2025

View reviewed changes

tkattkat changed the title ~~More evals~~ Add more ore agent evals STG-653 Aug 12, 2025

tkattkat requested a review from seanmcguire12 August 12, 2025 21:37

tkattkat marked this pull request as draft August 13, 2025 00:10

tkattkat marked this pull request as ready for review August 13, 2025 00:24

greptile-apps bot reviewed Aug 13, 2025

View reviewed changes

evals/tasks/agent/sign_in.ts Show resolved Hide resolved

evals/tasks/agent/google_flights.ts Show resolved Hide resolved

evals/initStagehand.ts Show resolved Hide resolved

evals/tasks/agent/google_maps_3.ts Show resolved Hide resolved

tkattkat changed the title ~~Add more ore agent evals STG-653~~ Add more agent evals STG-653 Aug 13, 2025

tkattkat added 8 commits August 13, 2025 09:41

add more evals for agent

6ee4de7

add more evals

bac6f01

update evals config

d30d4d0

add changeset

d924370

update eval

fffe986

offload initialization of agent to the runner

81f0e1a

update type

55ef435

update steps on eval

7b4b030

tkattkat force-pushed the more-evals branch from 72a8bdf to 7b4b030 Compare August 13, 2025 16:42

seanmcguire12 reviewed Aug 13, 2025

View reviewed changes

evals/initStagehand.ts Show resolved Hide resolved

seanmcguire12 reviewed Aug 13, 2025

View reviewed changes

evals/tasks/agent/all_recipes.ts Outdated Show resolved Hide resolved

miguelg719 changed the title ~~Add more agent evals STG-653~~ Add more agent evals Aug 13, 2025

tkattkat added 2 commits August 13, 2025 14:45

add more validation to evals + update default models for agent in tas…

d6c6dc0

…k config

update kith eval

439494e

tkattkat requested a review from seanmcguire12 August 13, 2025 22:19

seanmcguire12 reviewed Aug 19, 2025

View reviewed changes

tkattkat added 3 commits August 19, 2025 10:55

remove type casting

e8e2eeb

add agentInstance type

358a767

remove accidental commit

a824aa6

seanmcguire12 approved these changes Aug 19, 2025

View reviewed changes

tkattkat merged commit 8422828 into main Aug 19, 2025
14 checks passed

github-actions bot mentioned this pull request Aug 14, 2025

Version Packages #945

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more agent evals #961

Add more agent evals #961

Uh oh!

tkattkat commented Aug 12, 2025

Uh oh!

changeset-bot bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seanmcguire12 Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Add more agent evals #961

Add more agent evals #961

Uh oh!

Conversation

tkattkat commented Aug 12, 2025

why

what changed

test plan

Uh oh!

changeset-bot bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 3/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seanmcguire12 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

changeset-bot bot commented Aug 12, 2025 •

edited

Loading